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Although single-case researchers are not accustomed to analyzing data statistically, standards for re- 
search and accountability from government and other funding agents are creating pressure for more 
objective, reliable data. In addition, “evidence-based interventions” movements in special education, 
clinical psychology, and school psychology imply reliable data summaries. Within special education, 
two heavily debated single-case research (SCR) statistical indices are “percentage of non-overlapping 
data” (PND) and the regression effect size, R^. This article proposes a new index — ^PAND, the “per- 
centage of all non-overlapping data” — to remedy deficiencies of both PND and R^. PAND is closely 
related to the established effect size, Pearson’s Phi, the “fourfold point correlation coefficient.” The 
PAND/P/ti procedure is demonstrated and applied to 75 published multiple baseline designs to answer 
questions about typical effect sizes, relationships with PND and R^, statistical power, and time effi- 
ciency. Confidence intervals and p values for Phi also are demonstrated. The findings are that PAND/ 
Phi and PND correlate equally well to R^. However, only PAND/P/i/ could show adequate power for 
most of the multiple baseline designs sampled. The findings suggest that PAND/P/ti may meet the re- 
quirement for a useful effect size for multiple baseline and other longer designs in SCR. 


Single-case researchers traditionally have relied on visual 
analysis of graphs for judging intervention success. Despite 
the well-documented unreliability of visual judgments (Bros- 
sart, Parker, Olson, & Lakshmi, 2006; DeProspero & Cohen, 
1979; Harbst, Ottenbacher, & Harris, 1991; Ottenbacher, 1990; 
Park, Marascuilo, & Gaylord-Ross, 1990), most published 
single-case research (SCR) continues to rely on visual judg- 
ments, assisted by comparisons of phase means, medians, or 
percentages. Visual analysis also commonly includes judging 
the amount of data overlap between phases, which helps cap- 
ture the important concept of data dispersion or variability 
around a center. A recent study of 124 published SCR datasets 
(Parker et al., 2005) found statistical analyses in only 11%, 
which is comparable to the 10% rate cited in earlier studies of 
10 and 25 years ago (Busk & Marascuilo, 1992; Kratochwlll 
& Brody, 1978). Thus, in acceptance of statistical analysis, the 
SCR field has changed little over recent decades. 

For the 75 multiple baseline designs (MBDs) sampled in 
this study, most authors (87%) relied solely on visual analysis, 
with phase percentages or means calculated, but no tests of 
differences between these indices. Data variability around the 
means and percents (e.g., standard deviations) was not pre- 
sented, nor were reliabilities (standard error or confidence in- 
tervals [CIs]). The summary statistics served as nonessential 
additions to the primary visual analysis. Of the eight designs 
(11%) that were evaluated statistically, six used the student t 


test, one a Freidman two-way nonparametric test, and one a 
repeated measures ANOVA. Several of these authors referred 
to visually apparent data overlap between phases, but without 
quantifying the overlap. 

Recent changes within the fields of education and psy- 
chology bolster the arguments for more objective and reliable 
results. These are the movements for evidence-based inter- 
ventions, practices, or treatments in fields such as special ed- 
ucation (Odom et al., 2005), school psychology (Kratochwlll 
& Stoiber, 2000), and clinical psychology (Chambless & Ol- 
lendick, 2001). In school and clinical psychology, these 
movements are setting standards for reporting intervention 
efficacy, including objective and statistically reliable sum- 
maries that can be interpreted across the field and that permit 
comparisons between studies. The field of special education 
also is setting design standards, but has been slower to accept 
statistical summaries. 

The evidence-based movements in education and psy- 
chology were accelerated (or spawned) by the federal legis- 
lation No Child Left Behind (NCLB; 2001) and the Education 
Sciences Reform Act (ESRA; 2002). These laws are played 
out in the new Institute of Education Sciences (lES; White- 
hurst, 2004), which has set higher standards for funded educa- 
tional research, including SCR. Corresponding policies also are 
reflected in statements such as the National Research Council’s 
Scientific Research in Education (Shavelson & Towne, 2002). 
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The momentum for more objective and reliable SCR re- 
sults also is accelerated by the need to have SCR studies in- 
cluded in meta-analyses and other non-SCR publications 
(Homer, Carr, Halle, McGee, Odom, & Wolery, 2005). Meta- 
analyses of single-case studies generally are separate from 
those of group research, and results from the two methodolo- 
gies may not agree (Fomess, 2001). This inconsistency may be 
because in the SCR field, “the synthesis of single-participant 
studies remains a controversial topic” (Forness, 2001, p. 190). 
The meta-analyses of special education research summarized 
by Forness failed to use standard effect sizes, instead em- 
ploying methods that discard much of the data. Mostert (2001) 
evaluated all special education meta-analyses to that date 
against minimum standards and found most to be deficient. 
One missing critical piece of information was “accounting 
for the amount of total variance explained by the treatment ef- 
fect . . . .” The higher the proportion of variance accounted 
for, the stronger the evidence for the efficacy of the treatment 
or intervention” (Mostert, p. 215). The summaries of treat- 
ment effect used by most SCR meta-analyses do not meet this 
minimum criterion. 

The PND Versus Debate 

Changes in education and psychology, especially the need to 
include SCR data in meta-analyses, have prompted the de- 
velopment of methods for calculating SCR effect sizes. In this 
context, a spirited debate over the best SCR effect size oc- 
curred over two decades, pitting against percent of non- 
overlapping data (PND; Allison & Gorman, 1993; Allison & 
Gorman, 1994; Scruggs, Mastropieri, & Casto, 1987; Scruggs 
& Mastropieri, 1998; Scruggs & Mastropieri, 2001; White, 
1987). R^, the regression effect size most frequently used in 
all social science research (Kirk, 1996), is championed by Al- 
lison and colleagues. It is easily converted to a second pop- 
ular effect size, Cohen’s (1988) d, the standardized mean 
difference (Rosenthal, 1991). For an AB design, R^ may be 
translated to the following: (a) “proportion of a client’s score 
variance explained by phase differences,” (b) “reduction in 
uncertainty (percent increase in prediction ability) due to 
phase differences,” (c) “the percent of the scores of one phase 
exceeded by the upper half of the scores of the other phase,” 
and (d) “percent of non-overlap of client scores between 
phases” (Cohen, 1988, p. 22). Thus, the R^ effect size shares 
some meaning with PND. 

The competing index of effect was PND, championed by 
Scruggs, Mastropieri, and Casto (1987). PND is the percent- 
age of Phase B data that are more extreme (in an improvement 
direction) than the single most extreme Phase A data point. 
PND can be hand-calculated from a printed graph (Scruggs 
& Mastropieri, 1994). Scruggs and Mastropieri (1994) offered 
general interpretational guidelines of PND >70 for effective 
interventions, 50 < PND < 70 for interventions of question- 
able effectiveness, and PND < 50 for interventions with no ob- 


served effect. Gauged by acceptance in the field, PND won 
this debate. Of 15 single-case meta-analyses found by Scruggs 
and Mastropieri (2001), two thirds used PND, and only two 
used regression. Further supporting PND is its endorsement 
by recognized text authors Kazdin (1982) and Tawney and 
Cast (1984), as well as some meta-analysts (Kavale, Mathur, 
Forness, Quinn, & Rutherford, 2000). 

The debate on PND versus regression highlighted their 
strengths and weaknesses. A summary from the debate articles 
indicates that PND offers at least three advantages. First is ease 
of calculation, as PND can be conducted with a pencil and ruler 
on a printed graph, and as a percentage calculation. Second is 
acceptability to visual analysts, as PND’s emphasis on over- 
lapping data reflects a key component of most visual analyses. 
The third advantage is PND’s applicability to any SCR design. 

The debate also helped define at least four limitations of 
PND, some acknowledged by its authors. First, PND is nei- 
ther an effect size nor related to an accepted effect size, so it 
needs its own interpretation guidelines. Second, PND has un- 
known reliability, as it lacks a known sampling distribution, 
so p values and confidence intervals cannot be calculated. The 
third weakness is that PND ignores all phase A data except 
for one data point, which because of its extremity, is likely the 
most unreliable. The fourth limitation is that PND lacks sen- 
sitivity or discrimination ability, as it nears 100%, for very 
successful interventions. 

The competing regression approach (R^ index) advo- 
cated by Allison and Gorman (1993) demonstrates four major 
strengths. First, it results in R^ or Cohen’s d effect sizes, which 
are well established within the broader research community. 
Second, it permits calculation of confidence intervals to indi- 
cate the effect size’s trustworthiness or reliability. Third, re- 
gression uses all data in both phases of SCR. Fourth, the 
regression approach can be expanded for more complex analy- 
ses (e.g., including phase trends; Bloom, Fischer, & Orme, 
2003). 

The debate also highlighted at least three limitations of 
the regression approach. The first limitation is that the para- 
metric data assumptions of normality, equal variance, and se- 
rial independence are commonly not met by SCR data. 
Second, regression analyses can be unduly influenced by ex- 
treme outlier scores. The third limitation is that expertise is 
required to conduct and interpret regression analyses and to 
judge whether data assumptions have been met. 

PAND and Phi or Phi^ 

This article introduces an alternative index, the “percentage 
of all non-overlapping data” (PAND), and allied indices from 
the same 2x2 table: Phi and Phf, and examines their tech- 
nical adequacy. Both Phi and Phf are emphasized in this ar- 
ticle, because a recent movement favors the unsquared terms 
(R and Phi), though the squared coefficient is still more com- 
monly published (Abelson, 1985; Rosenthal, Rosnow, & Rubin, 
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2000). Like PND, PAND reflects data non-overlap between 
phases but differs in important respects. PAND uses all data 
from both phases, avoiding the criticism leveled at PND for 
wastefulness and for overemphasis on one unreliable data 
point. Importantly, PAND can be translated to Pearson’s Phi 
and Phf, which are both “bona fide effect sizes” (Cohen, 
1988, p. 223). Phi is a Pearson /? for a 2 x 2 contingency table, 
so Phf and are equivalent. Standing alone, PAND lacks 
status, as does PND. However, Phi and Phf have known sam- 
pling distributions, so p values are available, statistical power 
can be estimated, and CIs can be included to indicate effect 
size reliability (Cohen, 1988; Fleiss, 1981). Also, Phf can be 
transformed to Cohen’s d, a recognized effect size in another 
metric. 

The data requirements for PAND are minimal — just 
those for a chi-square test with frequency data (Cohen, 1988; 
Hays, 1988) — mainly, a minimum of 20 data points (5 per cell 
of the 2x2 table). The parametric requirements of equal vari- 
ance and normality do not apply. The requirement of serial in- 
dependence or lack of autocorrelation has little Impact on 
PAND results because the tabled frequency data are unordered. 

Unbalanced 2x2 tables can cause problems in calcu- 
lating valid effect sizes, but the method espoused here pro- 
duces tables that always have equal marginal proportions. The 
method of calculating overlapping data points (the minimum 
number that would have to be swapped across phases to elim- 
inate all overlap) always yields an equal number in baseline 
and intervention phases. The equality of these two values pro- 
duces a 2 X 2 table with exactly balanced marginals. 

Two limitations cited for PND are not remedied by PAND. 
The first is insensitivity at the upper end of the scale. When 
there is no data overlap between Phases A and B, both PND 
and PAND award a 100% score, regardless of the distance be- 
tween the two data clusters. PAND’s second limitation is that 
it measures simple mean level shifts, not controlling for pos- 
itive baseline trend. Like PND, it does not try to adjust for 
prior rate of improvement. Any positive baseline trend must 
be considered before attempting to infer a causal link between 
intervention and behavior (Parker, Cryer, & Burns, 2006). A 
large effect size alone does not imply that change was due to 
the intervention. 

Although PAND avoids parametric data assumptions, its 
major disadvantage compared with an interval-level analysis 
is the reduced statistical power of a nominal-level tech- 
nique. However, this reduced power will be problematic only 
if it prevents reliable detection of effect sizes large enough to 
be considered important. That is an empirical question that 
can be answered only from examining actual published data- 
sets to determine the effect sizes that typically result. This 
study does just that, comparing statistical power of Phi and 
Phf with that of R and R^ for a sample of 75 published SCR 
datasets. 

Given the total sample size required, the PAND/P/t/ pro- 
cedure is not recommended for short, single-baseline designs 
of less than 20 or 25 data points. This article shows that sam- 


ples of 60 to 80 are typical with MBDs, so they are well-suited 
for PAND analysis. The PAND/P/;/ procedure also will be 
useful for contrasts within longer, single-series designs, in- 
cluding multiphase designs, such as ABABA, although only 
MBDs are demonstrated here. The focus of this article is lim- 
ited to testing a defensible effect size for one popular, strong, 
and complex SCR design. 

PAND calculations are demonstrated both by hand and 
from a data spreadsheet. A touted strength of PND is that it 
can be calculated by hand from a graph. However, we found 
this procedure to be inaccurate for longer designs and more 
crowded graphs, and had to resort to the data spreadsheet for 
accurate calculation of both PND and PAND. PAND calcula- 
tions are less complex than ANOVA or regression analyses, 
as PAND does not require interpretation of data assumptions 
output (equal variance, homogeneity, serial independence). 
For this study, the efficiency of conducting PAND was mon- 
itored with three experienced single-case researchers who had 
not previously used the technique. 

PAND is presented as an alternative to PND for larger 
SCR datasets, as are typical in the special education literature. 
It is recommended for local use in school or clinic, for docu- 
mentation and accountability, and for meta-analyses and other 
scholarly publications. This article applies the technique in 
detail to a single dataset and then field-tests it with 75 pub- 
lished SCR datasets. PAND is applied only to the highly re- 
garded multiple baseline design, which has been a particularly 
challenging design for the field to adequately analyze. A good 
presentation on MBD analysis is by Busk and Serlin (1992), 
who recommend calculating whether data have met paramet- 
ric assumptions, and depending on those results, then applying 
one of four alternative methods of analysis. The alternatives 
(hand calculations, individual t tests, Wilcoxon, ANOVA) pro- 
duced very different effect size magnitudes of Cohen’s d = 
3.56 to 5.98 for their sample data. To date, the MBD lacks a 
generally acceptable statistical summary. 

The purpose of this article is to demonstrate PAND as a 
generally applicable analytic technique and to field-test it with 
a reasonable sample of published data. Field-testing PAND/ 
Phi with 75 published MBDs helps answer questions that po- 
tential users of the technique would logically pose: (a) What 
Phi effect sizes are typical with published research data, and 
how are they distributed? (b) What is the reliability of these 
effect sizes? (c) How does PAND/P/t; correlate with the two 
alternatives, PND and ordinary least squares (OLS) regres- 
sion? (d) How much statistical power does Phi possess? and 
(e) How efficient is this new procedure? 

Method 

Demonstration Dataset 

PAND/P/t; is first demonstrated with a short, fabricated MBD 
dataset, created to facilitate replication. In this MBD, a note- 
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taking strategy seeks to improve the quality of homework by 
Adam, Bob, and Carol — three students with learning disabili- 
ties (LD). Homework is rated weekly on a 30-point scale. An 
MBD graph is presented in Figure 1 , with horizontal mean lines 
superimposed on each phase. Summary statistics on the data 
are presented in Table 1 . 


32 .0 , 
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26.3 - 
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FIGURE 1. Multiple baseline across students (fabri- 
cated data). 


Figure 1 and Table 1 show that the three data series for 
Adam, Bob, and Carol are of unequal length, as well as un- 
equal means and variances, qualities that cause problems for 
an (ANOVA or regression analyses). The example was 
made challenging in another way: Two of the baseline phases 
are too short and variable to infer a stable trend. Therefore, 
trend control techniques available in analyses cannot be le- 
gitimately applied. 

Visual analysis of Figure 1 reflects a somewhat effective 
intervention. Data clusters between Phase A and Phase B are 
not widely separated, and overlap is apparent for each student 
series. Overlapping data points are defined as the minimum 
number that would have to be swapped across phases for com- 
plete score separation. With a transparent ruler, we carefully 
calculated overlapping data as 2 for Adam, 2 for Bob, and 2 
for Carol, totaling 6, or 6/28 = 21.4% overlap. BAND is there- 
fore 100 - 21.4 = 78.6%. From PAND, we also can calculate 
non-overlap beyond chance level (50%): 78.6 - 50 = 38.6% 
beyond chance level. 

For short datasets like this example, counting from the 
graph is usually accurate. For longer, more crowded datasets, 
an alternative spreadsheet sorting method will be provided. 
However, first the hand-calculation method is continued to 
demonstrate a 2 x 2 table and a Phi effect size. 

Table 2 demonstrates the creation of the 2x2 contin- 
gency table in two steps, from left to right. Beginning at the 
left, the percentages of data points in the baseline and inter- 
vention phases are calculated: 13/28 = 46.4%, 15/28 = 53.6%. 
These percentages are entered at the bottom of their respec- 
tive columns (see Table 2, left side). Next, the proportion of 
overlapping data (21.4%) is split between cells h and c: 10.7% 
in each cell. These two cells represent “too high” scores in the 
baseline phase (cell b) and “too low” scores in the interven- 
tion phase (cell c). Finally, cells a and d are filled in by sub- 
traction: 46.4 - 10.7 = 35.7 and 53.6 - 10.7 = 42.9. Because 
this table has completely balanced vertical and horizontal 
marginals, a Pearson Phi effect size can be calculated as the 
difference between the two cell ratios: [a! (a -i- c)] - [hi (h + 
d)]. In this example 35.7/46.5 - 10.7/53.6 = .768 - .199 = 
.569, so Phi = .57. This Phi value can be confirmed by enter- 


TABLE 1. Descriptive Summary of the Fabricated Multiple Baseline Example 


Statistics 


Students 


Adam 

Bob 

Carol 

Phase scores 

A: 20, 18, 20 

A: 18, 17, 16, 20 

A: 19, 18, 24, 22, 21, 19 


B: 20, 21, 23 

B: 19, 22, 19, 20, 20, 24 

B: 30, 21, 23, 28, 32, 34 

Number of scores 

6 

10 

12 

SDs 

1.63 

2.32 

5.43 

Phase means 

A: 19.3 

B: 21.3 

A: 17.8 


B: 20.7 

A: 20.5 

B: 28.0 
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TABLE 2. Two-Stage Construction of the 2x2 Table of Proportions 


Overlap 

Phase A 

Total 

Overlap 

Phase B 

Total 

Intervention 

Baseline 

Intervention 

Baseline 

Higher 


10.7 


Higher 

35.7 

10.7 

46.4 


cell a 

cell h 



cell a 

cell h 


Lower 

10.7 



Lower 

10.7 

42.9 

53.6 


cell c 

cell d 



cell c 

cell d 


Total: 

46.4 

53.6 

100 

Total: 

46.4 

53.6 

100 


ing the table’s four inside numerals into either a crosstabs sta- 
tistical module or a “test of two independent proportions” 
module; they will both yield .57. A proportions test module 
is preferred because nearly all provide CIs for Phi. For a 90% 
level of confidence, the exact bootstrap interval is .43 « .57 
» .71. So we can be 90% certain that the true effect size is 
somewhere between .43 and .71. For comparison only, a re- 
peated measures ANOVA was conducted on the original ex- 
ample data. The ANOVA yielded a partial effect size for phase 
(A vs. B) of /? = .60 (and = .36). ANOVA output indicated 
that our data failed to meet data assumptions of equal vari- 
ance and normality, let alone the more stringent repeated mea- 
sures assumption of circularity (Breckler, 1990; Keselman et 
al., 1998). 

For longer and complex datasets, calculation of PAND 
from the graph may not be accurate. For those cases, an al- 
ternative spreadsheet sorting method will yield identical re- 
sults. The sorting procedure is described in detail in the 
Appendix. As a brief overview, data are entered into four tall 
columns; Random, Series, Phase, and Score. Random is filled 
with random numbers. Phase (coded A or B) is copied and 
held in memory. Next, the dataset is sorted by Random. Then 
the dataset is sorted by Score within Series (a nested sort). Fi- 
nally, the copied Score content is pasted into a new Sorted col- 
umn. A crosstabs analysis of Phase and Sorted will yield the 
same results obtained earlier by hand. 

Applying PAND to Published Data 

PAND was applied to 75 published MBDs, obtained from 49 
published articles (asterisked in References under the heading 
“Studies Sampled for Multiple Baseline Data”). In this field 
test, PAND was compared to PND and to R-. Analyses were 
guided by research questions about the magnitude, reliability, 
intercorrelation, and statistical power of the new Phi and Phf 
indices. 

Sample Data 

This study used a convenience sample of 75 multiple baseline 
designs. The MBD datasets were culled from a broader search 


within ERIC and PsycLIT from the past 20 years for all SCR 
designs, based on search terms such as multiple baseline, sin- 
gle case, single subject, time series, and baseline. The initial 
search revealed 104 promising MBD graphs. From those, all 
that were large and clear enough for digitizing were used. 
Only initial AB phase comparisons were analyzed in this 
study, although several of the designs included ABAB or ABC 
series. The final sample was 75 useable graphs from 49 arti- 
cles. 

Digitizing Graphs 

The digitizing software i-extractor (Linden Software, 1998) 
was used to reduce published graphs back to their original 
data. This was accomplished in four steps. First, graphs were 
scanned at a resolution of 300 dots per inch (dpi) into a com- 
puter, and the resulting JPEG picture files were opened with 
i-extractor. Second, graph axes were set to provide actual data 
values on a digital Cartesian coordinate spreadsheet. Third, 
clicking on each data point read its value into a Microsoft Ex- 
cel spreadsheet. Finally, data values were regraphed, and these 
graphs were compared with the originals from the articles. Re- 
liability was checked by reprinting graphs from the digitized 
data; resizing them to match the originals; then overlaying the 
original and the recreated graphs and holding them against a 
bright window, which permitted quick scanning for any mis- 
placed data points. Adjustments were required for five of the 
graphs, due to human error. The 75 datasets were recreated in 
the Number Crancher Statistical Systems (NCSS; Hintze, 2004) 
statistical package. Datasets were constructed with five vari- 
able columns, for Random, Series, ABPhase, Score, and Sorted, 
as mentioned earlier, and described in detail in the Appendix. 

Analyses 

The 75 MBD graphs were analyzed by PND, PAND, and OLS 
regression procedures (for R^). PND was calculated for each 
separate series, and the results were then averaged for a total 
PND score. We attempted to calculate PND directly from 
printed graphs with a transparent ruler, but for several graphs, 
the large number of crowded data points and their high vari- 
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ability lead to unreliable results. Therefore, PND visual analy- 
sis had to be assisted by examining Score and ABPhase col- 
umns in the data spreadsheet. PAND was calculated entirely 
by the sorting procedure and crosstabs analysis, as detailed in 
the Appendix, producing three scores; PAND, Phi, and PhP. 
To obtain R^, regression analyses with the dichotomous Phase 
variable were conducted separately for each data series in a 
design, and the individual values were averaged together. 
Deteriorating series were given negative R^ values. Regres- 
sion analyses were to serve only as a ballpark external stan- 
dard, without regard for meeting parametric data assumptions. 

Results 

Descriptive Results 

Seventy-five multiple baseline designs were selected from 49 
published articles. The articles are in ihe References section un- 
der the heading “Studies Sampled for Multiple Baseline Data.” 
Each MBD possessed between two and eight separate series (or 
“panels” or “baselines”), with the interquartile range (IQR), 
or middle 50% of datasets, having three or four (Mode = 3). 
Each component data series averaged 23 data points, or about 
1 1 for each of A and B phases. The average design possessed 
70 data points total (IQR = 45 to 96). The smallest design had 
only 22 data points, and the largest had 219. 

Most authors (87%) relied solely on visual analysis, 
some also calculating phase percentages or means, but with no 
tests of phase differences. When means and percentages were 
presented, they lacked an index of variability (SD) or reliabil- 
ity (Standard Error or CIs). They were nonessential adjuncts 
to visual analysis. Of the only eight studies (11%) with some 
quantitative analysis, six used the student t test, one used a 
repeated measures ANOVA, and one used a Ereidman’s two- 
way nonparametric ANOVA. None of these eight studies in- 
cluded effect size CIs. 

Score Distributions 

Eigure 2 compares uniform probability (percentile) distribu- 
tions of PAND (upper triangles) and PND (lower circles). The 
most prominent feature of this distribution graph is that both 
PAND and PND tend to flatten out as their values approach 
100%, reflecting low discriminability. Eor the most success- 
ful interventions, PAND and PND distributions are quite sim- 
ilar, but they vary greatly for less effective interventions. The 
least effective interventions earned approximately PAND = 
50%, which is chance-level overlap between phases, and 
earned PND = 0-10%. The quartile distribution for PAND 
was 10th; .62, 25th; .72, 50th; .84, 75th; .92, and 90th; .97. 
The quartile distribution for PND was 10th; .19, 25th; .23, 
50th; .50, 75th; .67, and 90th; .76. 

Distribution of PAND’s effect size, PhP, is best under- 
stood in comparison with the well-known OLS R^. Eigure 3 


PAND & PND 



Percentile Rank 


FIGURE 2. Percentile distributions of PAND (upper 
triangles) and PND (lower circles) for 75 published 
multiple baseline designs. 


2o r,2 



Percentile Rank 


FIGURE 3. Percentile distributions of Phi^ scores 
(upper circles), and R^ (lower triangles) for 75 pub- 
lished multiple baseline designs. 

compares their respective distributions. PhP scores are the 
upper circles, and R^ are the lower triangles. R^ and Phf have 
similar distributions until values of about .60, where Phf be- 
comes larger. The Figure 3 distribution graph reflects a ceil- 
ing effect for PhP at the top end of the scale, where several 
scores are bunched. R^ does not have this ceiling effect. The 
quartile distribution for Phi^ was 10th; .05, 25th; .22, 50th; 
.53, 75th; .80, and 90th; .89. For R^ the quartile distribution 
was 10th; .08, 25th; .22, 50th; .50, 75th; .67, and 90th; .76. 
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Thus, Phf yields somewhat more extreme scores than 
larger for the most effective interventions, and smaller for the 
least effective. 

Intercorrelations 

The new index PAND, and its effect size Phf, were intercor- 
related with PND and indices, resulting in Table 3. As ex- 
pected, PAND &ndPhi have the highest correlation, at/? = .98. 
PAND and Phf also were highly correlated (R - .95). All 
other indices bore at least high-moderate relationships. The 
next strongest relationship (.90), between the two established 
effect sizes, R^ and Phf, shows they measure similar attrib- 
utes, though the first is continuous and the second categorical 
measurement. PND agreed with PAND and Phf at similar 
levels, .85 and .84. The one lower relationship (/? = .78), be- 
tween PND and R^, was not understandable, given that the 
two have little in common either in theory or in method. 

Statistical Power 

The relative merits of PAND and Phf were next examined 
with regard to the criterion of statistical power (i.e., the abil- 
ity to reliably detect effective interventions in typical data- 
sets). The award-winning NCSS (Hintze, 2004) Power 
Analysis and Sample Size (PASS) power analysis module was 
used to graph power analysis curves for Phi and for R^ (see 
Figure 4). The alpha ip level) was set at .05, and power was 
set at 20%. The two power analysis curves show the minimum 
number of MBD data points required to reliably detect a given 
effect size level for Phi (upper triangles) and for/?^ (lower cir- 
cles). Figure 4 is best understood by noting N = 69, which was 
the average MBD size for our sample of 75 datasets. At A = 
69, regression (/?^) reliably detects (at alpha = .05, and 80% 
power) values lower than R^ - .10, reflecting strong power. 
Regression possesses ample power for nearly 90% of the 
datasets, as the 10th percentile R^ score for the 75 datasets 
was .08. The Phi procedure, in reference to the same N = 69, 
reliably detected values as small as .34. Phi’s 10th percentile 
score was .22, and its 25th percentile score was .46. So Phi 
reliably detected more than 75% of the dataset effect sizes, 
but lacked the power to detect the smallest 10% to 15% of 
them. Thus, Phi had satisfactory, but not excellent, power for 
our sample. 

In addition to statistical significance. Phi and Phf offer 
a second related advantage over PND: confidence intervals. 
CIs are especially useful forjudging the reliability of obtained 
effect sizes at moderate levels. The question was posed: What 
Phi Cl widths are typical for published MBD data? To answer 
this question, exact, asymmetrical bootstrap CIs were ob- 
tained from proportions tests of five representative datasets, 
with Phi values at 90th, 75th, 50th, 25th, and 10th percentile 
levels. Cl width depends mainly on the size of the Phi value 
and on the available sample. The data for these five datasets 
were all weighted to equal N = 69. The 95% confidence level 


CIs are shown bracketing obtained Phi scores: 90th percentile: 
[.79 < .94 < .99], 75th percentile: [.71 < .86 < .94], 50th per- 
centile: [.47 < .68 < .82], 25th percentile: [.26 < .51 < .68], 
10th percentile: [-.02 < .22 < .44]. Three of these five CIs are 
narrow enough to indicate reasonable confidence in the Phi 
score. Even down at the 25th percentile, the interval of .26 to 
.68 gives us some confidence in the obtained Phi of .51. The 
exception is the Cl for the low 10th percentile Phi = .22. That 
Cl is double the Phi value, and drops below zero, giving us 
little or no confidence in Phi values this low. 

Efficiency 

Although PND is described as a quick and easy procedure, 
that proved not to be the case with longer MBD datasets with 
high data variability. For those datasets, reliable calculations 


TABLE 3. Intercorrelations Among Five Indices of 
Intervention Effect, Based on 75 Published Multiple 
Baseline Designs 


Approach 


PAND 

Phi 

Phf 


1.000 




PAND 

.872 

1.000 



Phi 

.870 

.978 

1.000 


Phf 

.901 

.945 

.973 

1.000 

PND 

.780 

.851 

.844 

.836 


R & Phi 



EIGURE 4. Power analysis graph for Phi and R^. 


THE JOURNAL OF SPECIAL EDUCATION VOL. 40/NO. 4/2007 201 


from a graph with pencil and ruler were not possible, neces- 
sitating use of the data spreadsheet. So for complex designs, 
use of the data spreadsheet was essential for both PND and 
PAND, making the two procedures equal in efficiency. For 
shorter and uncrowded visual displays, a transparent ruler per- 
mits accurate calculation from the graph. In those cases, cal- 
culation of PND and PAND is similar in efficiency. For extra 
computational effort. Phi can be calculated along with PAND. 
Cl calculation requires the further step of a proportions test. 

Discussion 

Accountability standards from funding agents and higher re- 
search and evaluation standards from government agencies 
are pressing researchers, including SCR practitioners, to pro- 
vide objective, reliable data. Reproducible methods for sum- 
marizing SCR data also are needed for meta-analyses. The 
evidence-based interventions movements in multiple fields 
call for objective and reliable judgments of intervention ef- 
fectiveness. Within special education, most actively debated 
have been two very different statistical indices: PND versus 
the effect size, obtained from OLS regression, t tests, or 
ANOVA. The main weaknesses of PND are that it overem- 
phasizes a single extreme phase A data point, has no known 
sampling distribution, has unknown reliability, and holds lit- 
tle currency within the broader research community. A pri- 
mary weakness of the regression approach is that single-case 
data too often fail to meet its parametric data assumptions. 

PAND, the percentage of all non-overlapping data, reme- 
dies some deficiencies of both PND and R^. PAND is a non- 
overlapping data index, but also is closely related, via a 2 x 2 
table, to the respected Pearson’s Phi effect size, the “fourfold 
point correlation coefficient.” (Cohen, 1988, p. 223). Phi is 
standard output from crosstabs analyses and from tests of two 
independent proportions. The latter statistical modules com- 
monly provide exact CIs for “the difference between propor- 
tions” (Cohen, 1988, p. 184), which for a balanced fourfold 
table is the same as Phi. 

The PAND/Phi procedure was applied to a sample of 75 
published multiple baseline designs. From these calculations, 
questions were answered regarding typical effect size magni- 
tude, relationships to PND and/?^, statistical power, and time 
efficiency. One finding of this study was the high level of 
agreement between PND and both PAND/Phi indices — close 
to R = .85. However, against the OLS-based R^, PND fared 
less well, at only R = .78, compared to .87 for the new mea- 
sures. This finding is understandable, considering PND’s 
unique procedure of focusing on a single phase A data point. 
PAND and Phi can both be calculated by hand, based on ob- 
served data overlap, but only PAND includes all data equally, 
reflecting more closely the non-overlap interpretation given 
to R (Cohen, 1988). Considering its unique approach to mea- 
suring data overlap, PND was a surprisingly strong performer. 
These results do not discount the claim of its authors that PND 


works well for local decision making. Scruggs and Mastro- 
pieri (1998) contended that PND agrees well with visual judg- 
ment, and that claim seems reasonable. Both PND and PAND 
distribution graphs showed the weakness of a ceiling effect — 
leveling and clumping of scores near the top of the distribu- 
tion. R^ avoids this weakness by continuing to increase as the 
phase scores become more widely separated, whereas PND 
and PAND cannot increase beyond 100% non-overlap. 

Statistical power cannot be computed for PND, but it was 
for Phi, with promising results. For short, single-series datasets. 
Phi may lack power, but most of the multiple baseline designs 
sampled had 45 to 96 data points (IQR), reasonably balanced 
across phases. With these datasets, and with effect sizes of 
moderate magnitude, PAND/Phi possessed sufficient power 
for .05 alpha inferences. Of course, matched against/?^, PAND/ 
Phi’s power is less impressive. However, significance tests 
and CIs are seldom needed for the weakest effect sizes. 

From calculations over the 75 datasets, PAND and PND 
proved approximately equal in efficiency. For shorter datasets, 
both could be calculated from a graph with the assistance of 
a transparent ruler. For more complex and longer datasets and 
for crowded graphs, both procedures require visual scrutiny 
of the source data columns. Calculating confidence intervals 
for Phi requires an additional step. The additional Cl informa- 
tion is highly desirable for publications, meta-analyses, and 
federal research grant-proposal writing. It is less needed for 
local decision making. 

The R^, Hedge’s g, and Cohen’s d OLS effect sizes — 
calculated from regression, t tests, or ANOVA — continue to 
be the most powerful analyses for multiple baseline designs. 
The major limitation to all three is that SCR data often fail to 
meet required parametric data assumptions. Failure to meet 
data assumptions can sometimes be addressed by data trans- 
formations or resampling/bootstrapping methodologies (Davi- 
son & Hinkley, 1997; Good, 2001). The second type of solution 
is to turn to nonparametric techniques, such as the categorical 
Phi summary, which entail few data assumptions. For visual 
analysts, PAND/Phi has the advantage of reflecting non-over- 
lapping data between baseline and intervention phases. Re- 
quiring similar effort to the competing PND, PAND offers 
what PND cannot: (a) acceptance within the broader research 
community and (b) p values and confidence intervals to indi- 
cate reliability. 
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Appendix 


Procedure for Calculating V ANT)/ Phi from MBD 


Datafile Setup 

The PAND and Phi sorting/crosstabs analysis is best accom- 
plished within a statistics package, but also can be done by 
Microsoft Excel. Five variable columns are created: Random, 
Series, ABPhase, Score, and Sorted. Into Random is pasted a 
set of random numbers. Series contains a different categori- 
cal tag for each series (e.g., I, II, III, IV). ABPhase is di- 
chotomous, containing categorical tags for the two types of 
phases (A, B). Scores contains original scores from all series. 
Sorted is an empty column in the spreadsheet, where results 
from a nested sort are later pasted. The data are entered in a 
tall vertical column, with series under one another. 

Procedure for Calculating PAND 

1 . Copy ABPhase. First, ensure the datafile is 
properly set up, with Time ascending (1, 2, 3, 
etc.), Series ascending (I, II, III), wA ABPhase 


ascending (A, B) for each Series. When the file 
is properly set up, copy contents of ABPhase, 
and hold it in computer memory. 

2. Randomize'. Sort the entire dataset by the Ran- 
dom column. 

3. Nested Sort: Sort Score within Series. If scores 
are expected to improve, then both variables 
are sorted normally, ascending. However, if 
Scores are expected to decrease across phases, 
then the nested Score is sorted inversely (de- 
scending). 

4. Paste the ABPhase data being held in memory 
(copied in Step 1) into the empty Sort column. 

5. Conduct a Crosstabs analysis on the ABPhase 
and Sort columns. Output will include the 2 x 
2 table, as well as the Phi statistic. For confi- 
dence intervals around Phi, analyze the table’s 
contents by a statistical module for testing two 
independent proportions. 


